Skip to content

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 11, 2023

Conversation

rhshadrach
Copy link
Member

This takes a bit of a perf hit with SeriesGroupBy.value_counts; I think this is because the current implementation always sorts the groupers (which is also fixed here).

# group boundaries are where group ids change
idchanges = 1 + np.nonzero(ids[1:] != ids[:-1])[0]
idx = np.r_[0, idchanges]
if not len(ids):
idx = idchanges

Also get a good perf improvement with categorical.

Perf code
import numpy as np
import pandas as pd
import time

size = 1000
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
col3_values = np.random.choice(col2_possible_values, size=size, replace=True)
df = pd.DataFrame(zip(col1_values, col2_values, col3_values), columns=["col1", "col2", "col3"])

print('DataFrameGroupBy - object')
%timeit df.groupby("col1").value_counts()

print('SeriesGroupBy - object')
%timeit df.groupby("col1")["col2"].value_counts()

df['col2'] = df['col2'].astype('category')

print('DataFrameGroupBy - category')
%timeit df.groupby("col1", observed=True)[["col2"]].value_counts()

print('SeriesGroupBy - category')
%timeit df.groupby("col1", observed=True)["col2"].value_counts()
# main
DataFrameGroupBy - object
4.09 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
SeriesGroupBy - object
1.32 ms ± 7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
DataFrameGroupBy - category
86.1 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
SeriesGroupBy - category
654 ms ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This PR
DataFrameGroupBy - object
4.1 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
SeriesGroupBy - object
2 ms ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
DataFrameGroupBy - category
84.7 ms ± 576 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
SeriesGroupBy - category
83.1 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@rhshadrach rhshadrach added Bug Groupby Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 4, 2023
@rhshadrach rhshadrach added this to the 2.0 milestone Jan 4, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fairly good. Just a merge conflict

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the merge conflicts are resolved

…pby_value_counts_sort

# Conflicts:
#	doc/source/whatsnew/v2.0.0.rst
#	pandas/tests/groupby/test_value_counts.py
…rach/pandas into groupby_value_counts_sort

# Conflicts:
#	doc/source/whatsnew/v2.0.0.rst
@rhshadrach
Copy link
Member Author

Merging to avoid any further whatsnew conflicts

@rhshadrach rhshadrach merged commit 4f42ecb into pandas-dev:main Jan 11, 2023
@rhshadrach rhshadrach deleted the groupby_value_counts_sort branch January 11, 2023 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby Performance Memory or execution speed performance
Projects
None yet
3 participants